Missing data in single-cell sequencing datasets poses significant challenges for extracting meaningful biological insights. However, existing imputation approaches, which often assume uniformity and data completeness, struggle to address cases with large patches of missing data. In this paper, we present CROT, an optimal transport-based imputation algorithm designed to handle patch-based missing data in tabular formats. Our approach effectively captures the underlying data structure in the presence of significant missingness. Notably, it achieves superior imputation accuracy while significantly reducing runtime, demonstrating its scalability and efficiency for large-scale datasets. This work introduces a robust solution for imputation in heterogeneous, high-dimensional datasets with structured data absence, addressing critical challenges in both biological and clinical data analysis. Our code is available at Anomalous Github.
Recent advancements in single-cell multi-omics, particularly RNA-seq, have provided profound insights into cellular heterogeneity and gene regulation. While pre-trained language model (PLM) paradigm based single-cell foundation models have shown promise, they remain constrained by insufficient integration of in-depth individual profiles and neglecting the influence of noise within multi-modal data. To address both issues, we propose an Open-world Language Knowledge-Aided Robust Single-Cell Foundation Model (OKR-CELL). It is built based on a cross-modal Cell-Language pre-training framework, which comprises two key innovations: (1) leveraging Large Language Models (LLMs) based workflow with retrieval-augmented generation (RAG) enriches cell textual descriptions using open-world knowledge; (2) devising a Cross-modal Robust Alignment (CRA) objective that incorporates sample reliability assessment, curriculum learning, and coupled momentum contrastive learning to strengthen the model's resistance to noisy data. After pretraining on 32M cell-text pairs, OKR-CELL obtains cutting-edge results across 6 evaluation tasks. Beyond standard benchmarks such as cell clustering, cell-type annotation, batch-effect correction, and few-shot annotation, the model also demonstrates superior performance in broader multi-modal applications, including zero-shot cell-type annotation and bidirectional cell-text retrieval.
Clustering is a central tool in biomedical research for discovering heterogeneous patient subpopulations, where group boundaries are often diffuse rather than sharply separated. Traditional methods produce hard partitions, whereas soft clustering methods such as fuzzy $c$-means (FCM) allow mixed memberships and better capture uncertainty and gradual transitions. Despite the widespread use of FCM, principled statistical inference for fuzzy clustering remains limited. We develop a new framework for weighted fuzzy $c$-means (WFCM) for settings with potential cluster size imbalance. Cluster-specific weights rebalance the classical FCM criterion so that smaller clusters are not overwhelmed by dominant groups, and the weighted objective induces a normalized density model with scale parameter $σ$ and fuzziness parameter $m$. Estimation is performed via a blockwise majorize--minimize (MM) procedure that alternates closed-form membership and centroid updates with likelihood-based updates of $(σ,\bw)$. The intractable normalizing constant is approximated by importance sampling using a data-adaptive Gaussian mixture proposal. We further provide likelihood ratio tests for comparing cluster centers and bootstrap-based confidence intervals. We establish consistency and asymptotic normality of the maximum likelihood estimator, validate the method through simulations, and illustrate it using single-cell RNA-seq and Alzheimer disease Neuroimaging Initiative (ADNI) data. These applications demonstrate stable uncertainty quantification and biologically meaningful soft memberships, ranging from well-separated cell populations under imbalance to a graded AD versus non-AD continuum consistent with disease progression.
Unsupervised cell type identification is crucial for uncovering and characterizing heterogeneous populations in single cell omics studies. Although a range of clustering methods have been developed, most focus exclusively on intrinsic cellular structure and ignore the pivotal role of cell-gene associations, which limits their ability to distinguish closely related cell types. To this end, we propose a Refinement Contrastive Learning framework (scRCL) that explicitly incorporates cell-gene interactions to derive more informative representations. Specifically, we introduce two contrastive distribution alignment components that reveal reliable intrinsic cellular structures by effectively exploiting cell-cell structural relationships. Additionally, we develop a refinement module that integrates gene-correlation structure learning to enhance cell embeddings by capturing underlying cell-gene associations. This module strengthens connections between cells and their associated genes, refining the representation learning to exploiting biologically meaningful relationships. Extensive experiments on several single-cell RNA-seq and spatial transcriptomics benchmark datasets demonstrate that our method consistently outperforms state-of-the-art baselines in cell-type identification accuracy. Moreover, downstream biological analyses confirm that the recovered cell populations exhibit coherent gene-expression signatures, further validating the biological relevance of our approach. The code is available at https://github.com/THPengL/scRCL.
Surgical tumor resection aims to remove all cancer cells in the tumor margin and at centimeter-scale depths below the tissue surface. During surgery, microscopic clusters of disease are intraoperatively difficult to visualize and are often left behind, significantly increasing the risk of cancer recurrence. Radioguided surgery (RGS) has shown the ability to selectively tag cancer cells with gamma (γ) photon emitting radioisotopes to identify them, but require a mm-scale γ photon spectrometer to localize the position of these cells in the tissue margin (i.e., a function of incident γ photon energy) with high specificity. Here we present a 9.9 mm2 integrated circuit (IC)-based γ spectrometer implemented in 180 nm CMOS, to enable the measurement of single γ photons and their incident energy with sub-keV energy resolution. We use small 2 2 um reverse-biased diodes that have low depletion region capacitance, and therefore produce millivolt-scale voltage signals in response to the small charge generated by incident γ photons. A low-power energy spectrometry method is implemented by measuring the decay time it takes for the generated voltage signal to settle back to DC after a γ detection event, instead of measuring the voltage drop directly. This spectrometry method is implemented in three different pixel architectures that allow for configurable pixel sensitivity, energy-resolution, and energy dynamic range based on the widely heterogenous surgical and patient presentation in RGS. The spectrometer was tested with three common γ-emitting radioisotopes (64Cu, 133Ba, 177Lu), and is able to resolve activities down to 1 uCi with sub-keV energy resolution and 1.315 MeV energy dynamic range, using 5-minute acquisitions.
Single-cell data analysis seeks to characterize cellular heterogeneity based on high-dimensional gene expression profiles. Conventional approaches represent each cell as a vector in Euclidean space, which limits their ability to capture intrinsic correlations and multiscale geometric structures. We propose a multiscale framework based on Grassmann manifolds that integrates machine learning with subspace geometry for single-cell data analysis. By generating embeddings under multiple representation scales, the framework combines their features from different geometric views into a unified Grassmann manifold. A power-based scale sampling function is introduced to control the selection of scales and balance in- formation across resolutions. Experiments on nine benchmark single-cell RNA-seq datasets demonstrate that the proposed approach effectively preserves meaningful structures and provides stable clustering performance, particularly for small to medium-sized datasets. These results suggest that Grassmann manifolds offer a coherent and informative foundation for analyzing single cell data.
Single-cell RNA sequencing (scRNA-seq) enables the study of cellular heterogeneity. Yet, clustering accuracy, and with it downstream analyses based on cell labels, remain challenging due to measurement noise and biological variability. In standard latent spaces (e.g., obtained through PCA), data from different cell types can be projected close together, making accurate clustering difficult. We introduce a latent plug-and-play diffusion framework that separates the observation and denoising space. This separation is operationalized through a novel Gibbs sampling procedure: the learned diffusion prior is applied in a low-dimensional latent space to perform denoising, while to steer this process, noise is reintroduced into the original high-dimensional observation space. This unique "input-space steering" ensures the denoising trajectory remains faithful to the original data structure. Our approach offers three key advantages: (1) adaptive noise handling via a tunable balance between prior and observed data; (2) uncertainty quantification through principled uncertainty estimates for downstream analysis; and (3) generalizable denoising by leveraging clean reference data to denoise noisier datasets, and via averaging, improve quality beyond the training set. We evaluate robustness on both synthetic and real single-cell genomics data. Our method improves clustering accuracy on synthetic data across varied noise levels and dataset shifts. On real-world single-cell data, our method demonstrates improved biological coherence in the resulting cell clusters, with cluster boundaries that better align with known cell type markers and developmental trajectories.
Integrated Sensing and Communication (ISAC) is a key emerging 6G technology. Despite progress, ISAC still lacks scalable methods for joint AP clustering and user/target scheduling in distributed deployments under fronthaul limits. Moreover, existing ISAC solutions largely rely on centralized processing and full channel state information, limiting scalability. This paper addresses joint access point (AP) clustering, user and target scheduling, and AP mode selection in distributed cell-free ISAC systems operating with constrained fronthaul capacity. We formulate the problem as a mixed-integer linear program (MILP) that jointly captures interference coupling, RF-chain limits, and sensing requirements, providing optimal but computationally demanding solutions. To enable real-time and scalable operation, we propose ASSENT (ASSociation and ENTity selection), a graph neural network (GNN) framework trained on MILP solutions to efficiently learn association and mode-selection policies directly from lightweight link statistics. Simulations show that ASSENT achieves near-optimal utility while accurately learning the underlying associations. Additionally, its single forward pass inference reduces decision latency compared to optimization-based methods. An open-source Python/PyTorch implementation with full datasets is provided to facilitate reproducible and extensible research in cell-free ISAC.




Understanding causal heterogeneity is essential for scientific discovery in domains such as biology and medicine. However, existing methods lack causal awareness, with insufficient modeling of heterogeneity, confounding, and observational constraints, leading to poor interpretability and difficulty distinguishing true causal heterogeneity from spurious associations. We propose an unsupervised framework, HCL (Interpretable Causal Mechanism-Aware Clustering with Adaptive Heterogeneous Causal Structure Learning), that jointly infers latent clusters and their associated causal structures from mixed-type observational data without requiring temporal ordering, environment labels, interventions or other prior knowledge. HCL relaxes the homogeneity and sufficiency assumptions by introducing an equivalent representation that encodes both structural heterogeneity and confounding. It further develops a bi-directional iterative strategy to alternately refine causal clustering and structure learning, along with a self-supervised regularization that balance cross-cluster universality and specificity. Together, these components enable convergence toward interpretable, heterogeneous causal patterns. Theoretically, we show identifiability of heterogeneous causal structures under mild conditions. Empirically, HCL achieves superior performance in both clustering and structure learning tasks, and recovers biologically meaningful mechanisms in real-world single-cell perturbation data, demonstrating its utility for discovering interpretable, mechanism-level causal heterogeneity.




Single-cell RNA sequencing (scRNA-seq) reveals cell heterogeneity, with cell clustering playing a key role in identifying cell types and marker genes. Recent advances, especially graph neural networks (GNNs)-based methods, have significantly improved clustering performance. However, the analysis of scRNA-seq data remains challenging due to noise, sparsity, and high dimensionality. Compounding these challenges, GNNs often suffer from over-smoothing, limiting their ability to capture complex biological information. In response, we propose scSiameseClu, a novel Siamese Clustering framework for interpreting single-cell RNA-seq data, comprising of 3 key steps: (1) Dual Augmentation Module, which applies biologically informed perturbations to the gene expression matrix and cell graph relationships to enhance representation robustness; (2) Siamese Fusion Module, which combines cross-correlation refinement and adaptive information fusion to capture complex cellular relationships while mitigating over-smoothing; and (3) Optimal Transport Clustering, which utilizes Sinkhorn distance to efficiently align cluster assignments with predefined proportions while maintaining balance. Comprehensive evaluations on seven real-world datasets demonstrate that~\methodname~outperforms state-of-the-art methods in single-cell clustering, cell type annotation, and cell type classification, providing a powerful tool for scRNA-seq data interpretation.